Day7：實作Python Scrapy 20行內爬取Y Combinator Blog所有文章｜Kearch 1.0 爬蟲關鍵字報表工具

2018 iT 邦幫忙鐵人賽

DAY 8

Software Development

[行銷也要自動化] 用 Python Selenium + NodeJS + Amazon EC2 打造簡易關鍵字搜尋報表應用！系列第 8 篇

Day7：實作Python Scrapy 20行內爬取Y Combinator Blog所有文章｜Kearch 1.0 爬蟲關鍵字報表工具

2018鐵人賽 scrapy 行銷技術控 python3 virtualenv

Kyle

2017-12-23 15:21:27

17853 瀏覽

分享至

好，今天拿Y Combinator Blog的文章來小試爬蟲，我保證你會很有成就感 XD

scrapy爬蟲開始

啟動虛擬環境、安裝確認

首先確保你有啟動剛剛創好的虛擬環境，有的話terminal會顯示這樣：

user@ubuntu:/NodeJS/tutorial$ source tutorial/bin/activate
(tutorial) user@ubuntu:/NodeJS/tutorial$

第一行是用來啟動虛擬環境，tutorial是你的虛擬環境的命名，可以自訂。
還沒安裝virtualenv？看安裝教學

再次確保你已經成功安裝scrapy，在terminal運行python script看看：

(tutorial) user@ubuntu:/NodeJS/tutorial$ python
Python 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>>

沒出現錯誤訊息就表示安裝OK
如何在虛擬環境中安裝scrapy？ →運行pip install scrapy

撰寫 `articles_scrapy.py`

接下來新增一個file取名為articles_scrapy.py，在Node.js專案中我們習慣把未來需要引用的function放在/views資料夾中。

所以現在你的檔案結構會長這樣：

- bin/
----- www
- node_modules/
- public/
- routes/
-------- index.js
- views/
------- error.ejs
------- index.ejs
+ ----- articles_scrapy.py
- app.js
- package.json
- package-lock.json
- junkfood.json

引用scrapy套件並創一隻spider：

import scrapy

class ArticlesSpider(scrapy.Spider):
    name = "articles"
    start_urls = [
        'https://blog.ycombinator.com/',
    ]

寫一個parse方法，作為Spider的callback：

...
    def parse(self, response):

在寫完程式碼以前，我們先到 https://blog.ycombinator.com/ 觀察一下頁面結構。假設我們希望一次抓取Y Combinator Blog上的所有文章標題、文章連結、作者及標籤們。
在scrapy中的selector方法可以用.css()或.xpath()，這邊我先用.css()示範；
你應該會很快的發現，我們要的這四樣東西會重複在.loop-section這個class裡出現，且分別可以在以下定位找到：
文章標題：<a class="article-title">Title</a>
文章連結：<a class="article-title" href="link_here">Title</a>
作者：<a class="author url fn">Author Name</a>
標籤：<ul class="post-categories"><li><a>Tags</a></li></ul>

我們用selector把它寫在剛剛的parse裡：

...
    'title': response.css('a.article-title::text').extract_first(),
    'link': response.css('a.article-title::attr("href")').extract_first(),
    'author': response.css('a.author::text').extract_first(),
    'tags': response.css('ul.post-categories > li a::text').extract()

註1：response.css(ul > li)意思是所有ul的子項目li
註2：::text意思是指獲取純文字部分、::attr("href")則是獲取href內的url
註3：extract_first()指獲取陣列內的第一個物件，因為selector選到的東西若沒有指定第幾個，會回傳整串html object
註4：response是我們原本整個頁面內容callback，所以拿它來做select

ForLoop

問題是，我們不可能只抓第一個結果就滿足了對吧
所以在外層我們要使用python ForLoop，並用yield回傳成python dict：

...
    for article in response.css('div.loop-section'):
        yield {
            'title': article.css('a.article-title::text').extract_first(),
            'link': article.css('a.article-title::attr("href")').extract_first(),
            'author': article.css('a.author::text').extract_first(),
            'tags': article.css('ul.post-categories > li a::text').extract()
        }

換頁問題

還有一個問題，我們也不希望手動更新start_urls，那要如何自動換頁呢？
這時候可以運用parse.follow重複執行parse。
要做這件事，我們得告訴spider下一頁在哪裡；如果下一頁存在就爬、不在就停止。
頁面底部的Older Posts按鈕是我們的好夥伴，運用selector獲取它的href，透過parse.follow一頁一頁餵給spider：

...
    next_page = response.css('div.nav-previous a::attr("href")').extract_first()
    if next_page is not None:
        yield response.follow(next_page, self.parse)

整段加起來就是：

import scrapy

class ArticlesSpider(scrapy.Spider):
    name = "articles"
    start_urls = [
        'https://blog.ycombinator.com/',
    ]
    
    def parse(self, response):
        for article in response.css('div.loop-section'):
            yield {
                'title': article.css('a.article-title::text').extract_first(),
                'link': article.css('a.article-title::attr("href")').extract_first(),
                'author': article.css('a.author::text').extract_first(),
                'tags': article.css('ul.post-categories > li a::text').extract()
            }
        next_page = response.css('div.nav-previous a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

存成json

最後在terminal讓spider跑起來！並把結果存到articles.json。它會跑個10來分鐘，因為Y combinator Blog總共有294頁的blog list。

(tutorial) user@ubuntu:/NodeJS/tutorial/views$ scrapy runspider articles_spider.py -o articles.json

打開articles.json，你會得到滿滿的大索引：

[
{"title": "New Year\u2019s Buying Guide", "link": "https://blog.ycombinator.com/b2b-buying-guide/", "author": "Sharon Pope", "tags": ["Lists", "YC News"]},
{"title": "Y Combinator Female Founders Conference 2018", "link": "https://blog.ycombinator.com/y-combinator-female-founders-conference-2018/", "author": "Kat Ma\u00f1alac", "tags": ["Female Founders", "YC News"]},
{"title": "YC Alumni Who Paid It Forward", "link": "https://blog.ycombinator.com/yc-alumni-who-paid-it-forward/", "author": "Michael Seibel", "tags": ["Founder Stories"]},
...
]

小叮嚀：千萬不要覺得會了這個就等於會了爬蟲，是因為Y Combinator Blog的結構很整齊、也沒有登入問題或ajax載入，所以我們這次才能無痛爬蟲初體驗(Y)

下一篇我們會在jupyter notebook上示範如果遇到需要登入才能看到完整資料的網站，該怎麼用python模擬登入？

未來：如果遇到ajax動態載入資料的網站要怎麼破解？（這個剛好也是筆者這個小專案的Keyword詞組推薦來源）
最後我們就會碰到selenium及xvfb —— 當今天連json檔案都沒法在network panel看到時該怎麼hack？

happy scrapy!